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Abstract 

We introduce a simple, general framework for 
likelihood-free Bayesian reinforcement learn- 
ing, through Approximate Bayesian Compu- 
tation (ABC). The advantage is that we only 
require a prior distribution on a class of sim- 
ulators. This is useful when a probabilistic 
model of the underlying process is too com- 
plex to formulate, but where detailed simu- 
lation models are available. ABC-RL allows 
the use of any Bayesian reinforcement learn- 
ing technique in this case. It can be seen as 
an extension of simulation methods to both 
planning and inference. We experimentally 
demonstrate the potential of this approach 
in a comparison with LSPI. Finally, we intro- 
duce a theorem showing that ABC is sound. 
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parameters are not known. 

We propose a simple, general, reinforcement learning 
framework employing the principles o f Approximate 
Baye sian Computation (ABC, see (ICsillerv et al 



2010T ) for an overview) for performing Bayesian infer 
cncc using simulation. In doing so, we extend rollout 
algorithms fo r reinforcement l earning, such as those 
descr ibed in dBertsekas . 20061 Bertsekas &: Tsitsiklis , 



1996 : iDimitrakakis fc Lagoudakisl 



Lagoudakis & Parii l2003al) . to the case where 



2008: 



we do not know what the correct model to draw 
rollouts from is. 

We show how to use ABC to compute approximate 
posteriors over a set of environment models in the con- 
text of reinforcement learning. This includes a simple 
but general theoretical result on the quality of ABC 
posterior approximations. Finally, building on previ- 
ous approaches to Bayesian reinforcement learning, we 
propose a strategy for selecting policies in this setting. 



Bayesian re i nforce ment learning ( Strens . 2000t 1.1. The setting 



Vlassis et al. is the decision-theoretic ap- 



proach (jDeGrootl . 1970) to solving the reinforcement 
learning problem. However, apart from the fact that 
calculating posterior distributions and the Bayes- 
optimal de ci sion i s frequently intractable ( Dufj 120021 
iRoss et all 12008 ). another major difficulty is the 
specification of the prior and model class. While there 
exist a number of non-parametric Bayesian model 
classes which can be brought to bear for estimation 
of the dynamics of an unknown process, it may not 
be a trivial matter to select the correct class and 
prior. On the other hand, it is frequently known that 
the process can be approximated well by a complex 
parametrised simulator. The question is how to take 
advantage of this knowledge when the best simulator 
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In the reinforcement learning problem, an agent is act- 
ing in some unknown environment /i, according to 
some policy n. The agent's policy is a procedure 
for selecting a sequence of actions, with the action 
at time t being at G A. The environment reacts to 
this sequence with a corresponding sequence of obser- 
vations Xt € X and rewards r t £ R. This interaction 
may depend on the complete historjQ h G %, where 
rl = (X x A x R)* is the set of all state action re- 
ward sequences, as neither the agent or the environ- 
ment are necessarily finite-order Markov. For exam- 
ple, the agent may learn, or the environment may be 
partially observable. 

In this paper, we use a number of shorthands to sim- 
plify notation. Firstly, we denote the (random) prob- 



1 A history may include multiple trajectories in episodic 
environments. 
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ability measure for the agent's action at time t by: 

7r t 04)4p> { & A\x\r\a t - 1 ), (1.1) 

where x is a shorthand for the sequence {xi)\ =1 ; sim- 
ilarly, we use x\ for {xi)\ =k . We denote the environ- 
ment's response at time f + 1 given the history at time 
* by: 

IH(B) 4p (U ((ar t+1 ,r t+ i) G S | zW). (1.2) 

In a further simplification, we shall also use "Ktiflt) 
for the probability (or density) of the action actually 
taken by the policy at time t, and similarly, fj,t(xt) for 
the realised observation. Finally, we use P^ to denote 
joint distributions on action, observation and reward 
sequences under the environment \i and policy 7T. 

The agent's goal is determined through its utility: 



1 r t , 



(1.3) 



t=i 



which is a discounted sum of the total instantaneous 
rewards obtained, with 7 6 [0, 1]. Without loss of gen- 
erality, we assume that U G [0, U max ], The optimals 
policy maximises the expected utility E^ U. As in the 
reinforcement learning problem the environment /i is 
unknown, this maximisation is ill-posed. Intuitively, 
we can increase the expected utility by either: (i) Try- 
ing to better estimate \x in order to perform the max- 
imisation later (exploration), or (ii) Use a best- guess 
estimate of /1 to obtain high rewards (exploitation). 

In order to solve this trade-offj w e can adopt a 



Bayesian viewpoint (|DeGrood . Il970t ISavageL Il972h . 
where we consider a (potentially infinite) set of en- 
vironment models Ai. In particular, we select a prior 
probability measure £ on M. . For an appropriate subset 
B C A4, the quantity £(B) describes our initial belief 
that the correct model lies in B. We can now formu- 
late the alternative goal of maximising the expected 
utility with respect to our prior: 



U 



( (e;e/k( m ). 

Jm 



(1.4) 



We can now formalise the problem as finding a pol- 
icy 7r| £ argmax^E^ J7. Any such policy is B 'ayes- 
optimal, as it solves the exploration-exploitation prob- 
lem with respect to our prior belief. 

1.2. Related work and our contribution 



The first difficulty when adopting a Bayesian ap- 
proach to sequential decision making is that findin 
the policy maximising (|1.4[) is hard (Duflj, 1200 



even in restricted classes of policies ( Dimitrakakid . 
120111 ). On the other ha nd, simple he u ristics such 



as T hompson sampling ( Strend . 
19331) provide an efficient trade-off 



20121 : iKaufmanna et all 120121 ) between exploration 



2000; Thompson 



Agrawal fc Goval , 



and exploitation 
ist (lArava et al 



Kolter & Net 120091: 



Algh o ugh other heuris tic s ex- 



20121 ICastro fc Precurl 12007 
Poupart et all 120061 IStrens 



200C 



in this paper we focus on an approximate 
version of Thompson sampling for reasons of sim- 
plicity. The second difficulty is that in many 
interesting problems, the exact posterior calculation 
may be intractable, main l y due to partial obsery - 



ability (|Poupart fc Vlassisl l2008t iRoss et all [2008) 



Interestingly, an ABC approach would not suffer from 
this problem for reasons that will be made clear in 
the sequel. 

The most fundamental difficulty in a Bayesian frame- 
work is specifying a generative model class: it is not 
always clear what is the best model to use for an 
application. However, frequently we have access to 
a class of parametrised simulators for the problem. 
Therefore, one reasonable approach is to find a good 
policy for a simulator in the class, and then apply 
it to the actual problem. Methods for finding good 
policies using simulation have been extensively s t udied 
before feertsekad. l2006t iBertsekas k. Tsitsiklisl. Il996[ 
Dimitrakakis k. Laeoudakia l2008t Gabillon et all 
20 lit IWu et all l201dh . However, in all those cases 
simulation was performed on a simulator with fixed 
parameters. 

Approximate Bayesi a n Computat io n (A BC) (see 
Csillerv et all . l2010t iMarin et all l201lL for an 



overview) is a general framework for likelihood-free 
Bayesian inference via simulation. It has been devel- 
oped because of the existen ce of applicatio ns, such as 



econometric modelling (e.g. iGewekel Il999t ). where de- 
tailed simulators were available, but no useful analyt- 
ical probabilistic models. While ABC methods have 
also been used fo r inference in dynamical systems (e.g 
Toni et al.l . l2009l) . they have not yet been applied to 
the reinforcement learning problem. 

This paper proposes to perform Bayesian reinforce- 
ment learning through ABC on an arbitrary class of 
parametrised simulators. As ABC has been widely 
used in applications characterised by large amounts 
of data and complex simulations with many unknown 
parameters, it may also scale well in reinforcement 
learning applications. The proposed methodology is 
generally applicable to arbitrary problems, including 
partially observable environments, continuous state 
spaces, and stochastic Markov games. 
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ABC Reinforcement Learning generalises methods pre- 
viously developed for simulation-based approximation 
of optimal policies to the Bayesia n case. W h ile in the 
standard framework covered by iBertsekasl ( 19991 ). a 
particular simulator of the environment is assumed to 
exist, via ABC we can relax this assumption. We only 
need a class of parametrised simulators that contain 
one close to the real environment dynamics. Thus, the 
only remaining difficulty is computational complexity. 

Finally, we provide a simple but general bound for 
ABC posterior computation. This bounds the KL di- 
vergence of the approximate posterior computed via 
ABC and the complete posterior distribution. As 
far as we know, this is a new and widely applicable 
result, although some other t heoretical results u sing 
si milar assumptions app ear in (|jasra et all l20Ldh and 
in (jDean fc Singhl . 120111 ) for hidden Markov models. 



Section [5] introduces ABC inference for reinforce- 
ment learning, discusses its difference from standard 
Bayesian inference, and presents a theorem on the 
quality of the ABC approximation. Section[3]describcs 
the ABC-RL framework and the ABC-LSPI algorithm 
for continuous state spaces. An experimental illustra- 
tion is given in Sec. HI followed by a discussion in Sec. [5] 
The appendix contains the collected proofs. 

2. Approximate Bayesian Computation 

Approximate Bayesian Computation encompasses a 
number of likelihood- free techniques where only an ap- 
proximate posterior is calculated via simulation. We 
first discuss how standard Bayesian inference in rein- 
forcement learning differs from ABC inference. We 
then introduce a theorem on the quality of the ABC 
approximation. 

2.1. Bayesian inference for reinforcement 
learning 

Imagine that the history h € H has been generated 
from a process ji 6 M controlled with a history- 
dependent policy 7r, something which we denote as 
h ~ Now consider a prior £ on Ai with the prop- 
erty that £(• | 7r) = £(•), i.e. that the prior is indepen- 
dent of the policy used. Then the posterior probability, 
given a history h generated by a policy 7r, that /i G B 
can be written as: □ 



£{B | h,ir) 



(2.1) 



Fortunately, the dependence on the policy can be re- 
moved, since the posterior is the same for all policies 
that put non-zero mass on the observed data: 

Remark 2.1. Let h ~ P£. Then Vvr' ^ vr such that 
rf(h)>0, t(B\h,*) = t(B\h,*'). 

Consequently, when calculating posteriors, the policy 
employed need not be considered, even when the pro- 
cess and policy depend on the complete history. How- 
ever, in the ABC setting we do not have direct access 
to the probabilities jj,t, for the models /i in our model 
class A4. However, we can always generate observa- 
tions from any model: Xt+i ~ Ht- This idea is used by 
ABC to calculate approximate posterior distributions. 

2.2. ABC inference for reinforcement learning 

The main idea of ABC is to approximate samples from 
the posterior distribution via simulation. We produce 
a sequence of sample models //W from the prior £, 
and then generate data h^ from each. If the gen- 
erated data is "sufficiently close" to the history h, 
then the fc-th model is accepted as a sample from the 
posterior | h). More specifically, ABC requires 
that we define an approximately sufficient statistic 
/ : H — > W on some normed vector space (W, || • ||). If 
\\f(h) — f(h^ k ')\\ < e then /i( fc ) is accepted as a sample 
from the posterior. Algorithm Q] gives the sampling 
method in detail for reinforcement learning. An im- 
portant difference with the standard ABC posterior 
approximation, as well as exact inference, is the de- 
pendency on 7r. 

Note that even though Remark 12.11 declares that the 
posterior is independent of the policy used, when us- 
ing ABC this is no longer true. We must maintain 
the complete policy used until then to generate sam- 
ples, otherwise there is no way to generate a sequence 
of observations!! Intuitively, the algorithm can basi- 
cally be seen as generating rollouts from a number of 
simulators, sampled from our prior distribution. The 
sampled set of simulators with a sufficient close statis- 
tic is then an approximate sample from our posterior 
distribution. The first question is what types of statis- 
tics we need. 

In fact, just as in standard ABC, if the statistic is 
sufficient, then the samples will be generated according 
to the posterior. 

Corollary 2.1. If f is a sufficient statistic, then the 
set M returned by Alg. [7]/or e = is a sample from 
the posterior. 



2 For finite M, the posterior simplifies to £(/Lt | h, n) 



3 For episodic problems, we must maintain the sequence 
of policies used. 
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Algorithm 1 ABC-RL-Sample 



input Prior £ on M., history h 6 T-L, threshold s, 
statistic / : % — > W, policy 7r, maximum number of 
samples -ZV sam , stopping condition r. 
M = 0. 
for k = 1, . 



if \\f(h) - /(/i (fe) )|| < e then 

M:=MU{ ^ (fc) }■ 
end if 
if t then 
break 
end if 
end for 
return M 



The (standard) proof is deferred to the appendix. 
Thus, for e = 0, when the statistic is sufficient, the 
sampling distribution and the posterior are identical. 
However, things are not so clear when e > 0. 

We now provide a simple theorem which characterises 
the relation of the approximate posterior to the true 
posterior, when we use a (not necessarily sufficient) 
statistic with threshold e > 0. First, we remind the 
definition of the KL-divergence. 

Definition 2.1. The KL-divergence D between two 
probability measures £, £' on M. is 



d£(/i) 



d£( M ). (2.2) 



In order to prove meaningful results, we need some 
additional assumptions on the likelihood function. In 
this particular case, we simply assume that it is smooth 
(Lipschitz) with respect to the statistical distance: 

Assumption 2.1. For a given policy tt, for any /i, 

and histories x, h 6 H, there exists L > such that 
M¥l(h)/¥l(x)]\<L\\f(h)-f(x)\\. 

We note in passing that this ass umption is related t o 
the notion of differential privacy dDwork fc LeiLl2009l) . 
from which it was inspired. 

We now can state the following theorem, whose proof 
can be found in the appendix, which generalises the 
previous corollary. 

Theorem 2.1. Under a policy n and statistic f satis- 
fying Assumption QOl the approximate posterior dis- 
tribution £ e (- | h) satisfies: 



where A* = { z E % \ \\f(z) - f{h)\\ < e} is the e-ball 
around the observed history h with respect to the sta- 
tistical distance and \A^\ denotes its size. 

The divergence depends on the statistic in the follow- 
ing ways. Firstly, it approaches as e — ► 0. Sec- 
ondly, it is smaller for smoother likelihoods. However, 
because of the dependence on the size of the e-balo 
around the observed statistic, the statistic cannot be 
arbitrarily smooth. Nevertheless, it may be the case 
that a sufficient statistic is not required for good per- 
formance. Since in reinforcement learning reinforce- 
ment learning we are mainly interested in the utility 
rather than in system identification, we may be able 
to get good results by using utility-related statistics. 

Observation-based statistics A simple idea is to 
select features on which to calculate statistics. Dis- 
counted cumulative feature expectation are especially 
interesting, due to their connection with value func- 
tions (e.g. ?, Sec. 6.9.2). The main drawback is that 
this adds yet another hyper-parameter to tune. In ad- 
dition, unlike econometrics or bioinformatics, we may 
not be interested in model identification per se, but 
only in finding a good policy. 

Utility-based statistics Quantities related to the 
utility may be a good match for reinforcement learn- 
ing. In the simplest case, it may be sufficient to only 
consider unconditional moments of the utility, which 
is the approach followed in this paper. However, these 
may only trivially satisfy Ass. 12.11 for arbitrary poli- 
cies. Nevertheless, as we shall see, even a very simple 
such statistic has a reasonably good performance. 

2.3. A Hoeffding-based utility statistic 

In particular, given a history h including iVdat tra- 
jectories in the environment, with the i-th trajec- 
tory obtaining utility we obtain a mean estimate 

E 



'{/ A j^U^. We then obtain a history h^ 



containing AT tr j trajectories from the sampled environ- 
ment and construct the mean estimate E fe *' J U. In 
order to test whether the se are close enou gh, we use 
the Hoeffding inequality ( Hoeffdinsl . 1963 ). In fact, 
it is easy to see that, with probability at least 1 — 6, 



E"( fc ) U\ is lower bounded by: 



\E dat U-E k >U\-U n 



l\n(2/S)(N d&t + N trj ] 
2N dat N tli 



, (2-4) 



4 For discrete observations this is simply the counting 
measure of the ball. For more general cases it can be ex- 



D (£(• I h) || £ e (- | h)) < (1 + In \A^\)Le, (2.3) tended to an appropriate measure 
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where C/ max is the range of the utility func- 
tion. We then use (|2.4p as the statistical distance 
||/(/i) — between the observed history h and 

the sampled history h^ k \ The advantage of using this 
statistic is that the more data we have, it becomes 
harder to accept a sample. 

This statistic has two parameters. Firstly, the error 
probability 5, which does not need to be very small 
in practice, as the Hoeffding bound is only tight for 
high-variance distributions. The second parameter is 
-/V tr j. This does not need to be very large, since it 
only makes a marginal difference in the bound when 
-/V tr j ^> Ndat- An illustration of the type of samples 
obtained with this statistic is given in Figure [TJ which 
shows the dependency of the approximate posterior 
distribution on the threshold e when conditioned on a 
fixed amount iVdat of training trajectories. 

3. ABC reinforcement learning 

We now present a simple algorithm for ABC reinforce- 
ment learning, based on the ideas explained in the pre- 
vious section. For any given set of observations and 
policies, we draw a number of sample environments 
from the prior distribution. For each environment, we 
execute the relevant policy and calculate the appropri- 
ate statistics. If these are close enough to the observed 
statistic, the sample is accepted. The next step is to 
find a good policy for the sampled simulator. As we 
can draw an arbitrary number of rollouts in the simu- 
lator, any type of approximate dynamic programming 
algorit hm can be used. In our ex periments, we used 
LSPI (|Lagoudakis fc Parrl l2003bh . which is simple to 
program and effective. The hope is that if the approx- 
imate posterior sampling is reasonable, then we can 
take advantage of our prior knowledge of the environ- 
ment class, to learn a good policy with less data, at 
the expense of additional computation. 

Algorithm 2 ABC-RL 
parameters M, £, h, tt, / 

r = {\M\ = 1} 

/} = ABC-RL-Sample(A4, £, h, n, /, r) 
return 7r w arg max^ 1 



A sketch of the algorithm is shown in Alg(2] This has 
a number of additional parameters that need to be dis- 
cussed. The most important is the stopping condition 
r. The simplest idea, which we use in this paper, is 
to stop when a single model £i has been generated by 
ABC-RL-Sample. 

Then an (approximate) optimal policy for the sam- 



pled model fi can be found via an exact (or approxi- 
mate) dynamic programming algorithm. This simpli- 
fies the optimisation step significantly, as otherwise it 
would be necessary to optimise over multiple models. 
This particular version of the algorithm can be seen as 
an ABC variant o f Thompson sampling ( Strens . 2000; 
Thompson! . 119331 ). 



The exact algorithm to use for the policy optimisation 
depends largely upon the class of simulators we have. 
In principle any type of environment can be handled, 
as long as a simulation-based approximation method 
can be used to discover a good policy. In extremis, 
direct policy search may be used. However, in the 
work presented in this paper, we limit ourselves to 
continuous-state Markov decision processes, for which 
numerous efficient ADP algorithms exist. 

3.1. ABC-LSPI 

Let us consider the class of continuous-state, discrete- 
action Markov decision processes (MDPs). Then, 
a number of sample-based ADP algorithms can be 
used t o find good policie s, such as fitted Q-iteration 
(FQI) (lErnst et all 120051) and leas t -squar e policy it- 
eration (LSPI) (jLagoudakis &: Parrl l2003bl ). which we 
use herein. 

Since we take an arbitrary number of trajectories from 
the sampled MDP, an important algorithmic param- 
eter is the number of rollouts iV ro i to draw. Higher 
values lead to better approximations, at the expense 
of additional computation. Finally, since LSPI uses a 
linear value functiorj^] approximation, it is necessary to 
select an appropriate basis for the fit to be good. 

The computational complexity of ABC-LSPI depends 
on the quality of approximation we wish to achieve 
and on the number of samples required to sample a 
model with statistics £-close to those of the data. To 
reduce computation, if N sam models have been gener- 
ated without one being accepted, we double e and call 
ABC-RL-Sample again. 

4. Experiments 

We performed some experiments to investigate the vi- 
ability of ABC-RL, with all algorithms implemented 
using (?). In these, we compared ABC-LSPI to LSPI. 
The intuition is that, if ABC can find a good simulator, 
then we can perform a much better estimation of the 
value function by drawing a lage number of samples 



5 The value function V(s) is simply the expected utility 
conditioned on the system state s. We omit details as this 
is not necessary to understand the framework proposed. 
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Figure 1. Pendulum value distribution. In both cases, iV sam = 10 model samples are drawn from the prior and 
TVroi = 10 3 rollouts are performed for each model sample. The vertical dashed line shows the actual value of the policy. 
The solid and dot-dashed lines show the histograms of real and estimated values of the original policy in the sampled 
environment. The solid line shows the value estimated using 10 4 rollouts. The dot-dashed line shows the value estimated 
in the run itself, with iVtrj rollouts per sample. The x shows the expected value, averaged over the accepted samples. It 
can be seen that, while a smaller threshold can result in better accuracy, many fewer samples are accepted. 



from the simulator, rather than estimating the value 
function directly from the observations. 

4.1. Domains 

We consider two domains to illustrate ABC-RL. In 
both of these domains, we have access to a set of 
parametrised simulators M. = { fig | 9 € O } for the 
domains. However, we do not know the true param- 
eters 9* G of the domains. For ABC, sampled pa- 
rameters are drawn from a uniform distribution 
Unif(Q), with 9 = { 6 e W 1 | 6i S [±<9*, §0|] }. 

Mountain car This is a generalise d version of the 
mount ain car domain described in Sutton fc Bartol 
(|1998l ). The goal is to bring a car to the top of 
a hill. The problem has 7 parameters: upper and 
lower bounds on the horizontal position of the car, 
upper and lower bounds on the car's velocity, up- 
per bounds on the car's forwards and backwards ac- 
celeration power, and finally the amount of uniform 
noise present. The real environment parameters are 
9* = (0.5,-1.2,0.07,-0.07,0.001,0.0025,0.2). In this 
problem, the goal is to reach the right-most horizontal 
position. The observation consists of the horizontal 
position and velocity and the reward is —1 at every 
step until the goal is reached. 

Pendulum T his is a generalised vers ion of the pen- 
dulum domain ( Sutton fc Bartol 1998), but without 
boundaries. The goal of the agent in this environment 
is to maintain a pendulum upright, using a controller 
that can switch actions every 0.1s. The problem has 
6 parameters: the pendulum mass, the cart mass, the 
pendulum length, the gravity, the amount of uniform 



noise, and the simulation time interval. In this envi- 
ronment, the reward is +1 for every step where the 
pendulum is balanced. The actual environment pa- 
rameters are 9* = (2.0, 8, 0, 0.5, 9.8, 0.01, 0.01). 

4.2. Results 

We compared the offline performance of LSPI and 
ABC-LSPI on the two domains. We first observe iVdat 
trajectories in the real environment drawn using a uni- 
formly random policy. These trajectories are used by 
both ABC-LSPI and LSPI to estimate a policy. This 
policy is then evaluated over 10 3 trajectories. The 
experiment was repeated for 10 2 runs. Since LSPI re- 
quires a basis, in both cases we employed a uniform 
4x4 grid of RBFs, as well as an additional unit basis 
for the value function estimation. 

The results of the experiment are shown in Fig. [2] 
where we plot the expected utility (with a discount 
factor 7 = 0.99) of the policy found as the number 
of trajectories increase. Both LSPI and ABC-LSPI 
manage to find an improved policy with more data. 
However, the source of their improvement is different. 
In the case of LSPI, the additional data leads to better 
estimation of the value function. In ABC-LSPI, the 
additional data leads to a better sampled model. The 
value function is then estimated using a large number 
of rollouts in the sampled model. The CPU time taken 
by ABC ranges in 20 to 40s, versus 0.05 to 30s for pure 
LSPI, depending on the amount of training data. This 
is due to the additional overhead of sampling as well 
as the increased amount of rollouts used for ADP. 

In general, the ABC approach quickly reaches a good 
performance, but then has little improvement. This ef- 
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On the other hand, the performance is significantly 
better than LSPI in the pendulum environment 
(Fig. 2(b) I. There are two possible reason for this. 



10 1 10 2 
trajectories 

(a) Mountain Car 




10 1 10 2 
trajectories 

(b) Pendulum 



Figure 2. Off-line performance 

1(T 2 , ATtrj = 10 2 , iV r oi = 2 ■ 10 3 , 



. For 7V sam = 10 3 , e = 
7 = 0.99. The data are 



averaged over 10 runs, with each run being evaluated with 
10 3 trajectories. The shaded regions show 95% bootstrap 
confidence intervals from 10 bootstrap samples. 



feet is particularly prominent in the Mountain Car do- 



main (Fig. 2(a) I, where it is significantly worse asymp- 
totically than LSPI. This can be attributed to the fact 
that even though more data is available, the number 
of samples drawn from the prior is not sufficient for a 
good model to be found. In fact, upon investigation 
we noticed that although most model parameters were 
reliably estimated, there was a difficulty in estimating 
the goal location from the given trajectories. This was 
probably the main reason why ABC didn't reach op- 
timal performance in this case. However, it may be 
possible to improve upon this result with a more effi- 
cient sampling scheme, or a statistic that is closer to 
sufficiency than the simple utility-based statistic we 
used. 



Firstly, ABC-LSPI not only uses more samples for 
the value function estimation, but also better dis- 
tributed samples, as it estimates the value function 
by drawing trajectories starting from uniformly drawn 
states in the sampled environment. Secondly, and per- 
haps more importantly, that even for very differently 
parametrised pendulum problems the optimal policies 
on the pendulum domain are quite similar. Thus, even 
if ABC only samples a very approximate simulator, its 
optimal policy is going to be close to that of the real 
environment. 

5. Conclusion 

We presented an extension of ABC, a likelihood-free 
method for approximate Bayesian computation, to 
controlled dynamical systems. This method is par- 
ticularly interesting for domains where it is difficult to 
specify an appropriate probabilistic model, and where 
computation is significantly cheaper than data collec- 
tion. It is in principle generally applicable to any type 
of reinforcement learning problem, including continu- 
ous, partially observable and multi-agent domains. We 
also introduce a general theorem for the quality of the 
approximate ABC posterior distribution, which can be 
used for further analysis of ABC methods. 

We then applied ABC inference to reinforcement learn- 
ing. This involves using simulation both to estimate 
approximate posterior distributions and to find good 
policies. Thus, ABC-RL can be simultaneously seen 
as an extension of ABC inference to control problems 
and an extension of approximate dynamic program- 
ming methods to likelihood-free approximate Bayesian 
inference. The main advantage is when have no rea- 
sonable probabilistic model, but we do have access to 
a parametrised set of simulators, which contain good 
approximations to the real environment. This is fre- 
quently the case in complex control problems. How- 
ever, we see that ABC-RL (specifically ABC-LSPI) 
is competitive with pure LSPI even in problems with 
low dimensionality where LSPI is expected to perform 
quite well. 

ABC-RL appears a viable approach, even with a very 
simple sampling scheme, and a utility-based statis- 
tic. In future work, we would like to investigate more 
elaborate ABC schemes such as Markov chain Monte 
Carlo, as well as statistics that are closer to suffi- 
cient, such as discounted feature expectations and con- 
ditional utilities. This would enable us to examine 
its performance in more complex problems where the 
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practical advantages of ABC would be more evident. 
However, we believe that the results are extremely en- 
couraging and that the ABC methodology has great 
potential in the held of reinforcement learning. 

A. Collected proofs 

Proof of Remark \2.1\ Let h — (x T+1 , a T . r T ). Using 
induction, 



t=o 



Replacing in the posterior calculation (|A.1|) we obtain: 



£{B\h,*) 



(A.l) 



since the Ilt=o 7rt ( at ) terms can be taken out of the 
integrals and cancel out. □ 



Proof of Corollary \2.1\ By definition, a sufficient 
statistic / : % — > W has the following property: 

Va*,tt: K(h)=K(h') mf(h) = f(h'). (A.2) 



From Definition 12.1 
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Equality (a) follows from equations (|A.3|) and (IA.4I) . 
Inequality (b) follows from the fact that W^(A^) = 
T,zeA» P m( z ) > min 2 eA? while (c) follows from 

|x| > x. For (d), first note that for any z £ A^, 
by the definition of A*, | ln[P£(7i)/P£(>)]| < Le, by 
Assumption I2.ll Thus the first integral is bounded 
by f M £((J<) = £(M) = 1. Similarly, the | ■ | term 
in the second integral is independent of /i and so 
is taken out. For (e), the same assumption gives 
that <p{z) = f M 'P fhn (z)d£((i) < exp(Le)</>(/i) for any 
z G A h t so, \n[(j){A h t )/(j){h)] < Leln\A*\. Finally, as 
ft e 4', <f>{A^) > 4>{h) and we obtain the final re- 
sult. □ 



The probability of drawing a model in B C M is: 

/ B L e «i{/w = /(M}p;wdeM 



due to (|A3]) . 



= f(h)}T"(z)6£(ii) 
£(B\h,n), (A.3) 

□ 



Proof of Theorem \2.1\ For notational simplicity, we 
introduce <f>{-) = f M P^(-) d£(/x) for the marginal prior 
measure on Ti, also omitting the dependency on n. 
Then the ABC posterior £ e (-B | h) equals: 

Jb 1 { H/(^) - / W II < ^ F m (*) d ^) 

I M 1 - /CO II < £ 1 p m(^) d ^) 
_/ B p-(^)de(/i) _ f B p;(A^)d^) 



f M K(A*)dt(ri 
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